A Span Extraction Approach for Information Extraction on Visually-Rich Documents

نویسندگان

چکیده

Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows great potential pre-training methods. In this paper, we present a new approach improve capability model on VRDs. Firstly, introduce query-based IE that employs span instead using common sequence labeling approach. Secondly, extend formulation, propose training task focusing modelling relationships among semantic entities within document. This enables target spans be extracted recursively and can used pre-train or as an downstream task. Evaluation three datasets popular business (invoices, receipts) our proposed method achieves significant improvements compared existing models. The also provides mechanism knowledge accumulation from multiple tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Semantic Information Extraction Approach from Unstructured Documents

Recognizing and extracting meaningful information from semiand unstructured documents, taking into account their semantics, and storing them into database is an important problem in the context of information access and retrieval. This paper describes a novel logic-based approach to information extraction from both semiand unstructured documents. The approach, implemented in the HıLεX system, i...

متن کامل

A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

The Internet is a generous source of information. Semi-structured text documents represent great part of that information; commercial data-sheets of the Information Technology domain are among them (e.g. laptop computer datasheets). However our capacity to automatically gather and manipulate such information is limited due to the fact that those documents are designed to be read by people. Many...

متن کامل

Information Extraction Strategies for Thai Documents

The development of an information extraction (IE) system for Thai documents raises a number of issues which are not important for IE in English and other European languages. We describe the characteristics of written Thai and the problem statements, and our approach to the Thai IE system. The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are ne...

متن کامل

Few-exemplar Information Extraction for Business Documents

The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts ...

متن کامل

Information extraction for semi-structured documents

The number of unstructured or semi-structured documents produced in all types of organizations continues to increase rapidly. Cost-effective ways of finding the relevant ones and extracting useful information from them are increasingly important to a large number of enterprises for operational and decision-support applications. The approach discussed in this paper constitutes a suitable basis f...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-86159-9_25